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Description 

FIELD OF THE INVENTION 

[0001] The invention relates to a method for preparing s 
diphones for text-to-speech generating by selectively 
pronouncing a set of selected diphones, processing 
each such diphone for persistent storage, and storing 
each processed diphone in an individually addressable 
storage location. Diphones have been found an excel- 10 
lent vehicle for speech decomposition and representa- 
tion. For example, Patents US 4,81 3,074 to Marcus and 
US 5,109,418 to Van Hemert, both assigned to the 
present assignee teach the extraction of segment 
boundaries from spoken text. These methods can be ex- is 
cellently used for constituting a diphone from a pair of 
successive segmented elements of predetermined 
lenghts before and after the boundary, respectively. A 
set of such diphones in turn allows for representing 
speech in a robust manner. In particular, subsequent 20 
speech rendering is elementary, because no complex 
rules are necessary for realising the coarticulation be- 
tween contiguous phonemes. Once generated, the di- 
phones are stored in conventional storage for later us- 
age. The English language may be well represented by 25 
using a 1500 diphone repertory. The document Pro- 
ceedings of Speech '88, pages 55-59, discloses a multi- 
language speech synthesis system which is employed 
in a text-to-speech system. The hardware realisation 
employs two diphone dictionaries which are stored in 30 
two ROM's, respectively. The present inventor has en- 
countered the need for representing multilingual speech 
and in particular has found the required storage at 1 500 
locations for each language extremely burdening, espe- 
cially in an environment where minimal cost is a prereq- 35 
uisite. Such environments up to now usually have done 
either without speech rendering, or with only a single 
language, or with low-quality rendering methods, where 
storage cost is often irrelevant. 

40 

SUMMARY TO THE INVENTION 

[0002] Accordingly, amongst other things it is an ob- 
ject of the present invention to provide a method accord- 
ing to the preamble that allows to generate a diphone 45 
repertory for multilingual speech rendering that allevi- 
ates storage capacity requirements. Now, according to 
one of its aspects, the invention is characterized in that 
amongst such set as spoken by a single person on a 
homophony basis assigning each diphone to one or so 
more diverse languages for diminishing storage require- 
ments through sharing of selective diphones amongst 
more than one language. For example, six of the main 
European languages can be collectively represented by 
a set of only some 3000 diphones. This is a reduction ss 
by a factor of three versus the sixfold enlargement that 
a sixfold repetition of the 1500 diphones of each partic- 
ular single language would require. The invention relies 



on the combined idea, first, that many languages, in par- 
ticular European languages, have identical or nearly 
identical combinations of successive sounds, and sec- 
ond, that certain persons have the ability to pronounce 
such combinations in a way that would be appropriate 
to several particular languages, even if the person in 
question were not able to speak the language. Often, 
such persons are multilingual. In fact, it were only nec- 
essary to correctly pronounce a single combination of 
two sounds to the satisfaction of another person, who is 
able to speak the 'new 1 language. The more or less iden- 
tical pronunciation of a particular diphone in various lan- 
guages need not be based on correspondence in or- 
thography. On the other hand, identical orthography 
need not go hand in hand with identical pronunciation. 
[0003] The above implies that the reusable diphones 
are always produced by the same voice. On the other 
hand, it would be fully unacceptable if speech is con- 
structed from a sequence of diphones that are diverse 
in origin in that they have been generated by multiple 
persons. The storage proper can be effected in a soft- 
ware-controlled manner: it can be in the form of speech 
segments that are represented by an appropriate wave- 
form, or rather by parametrized data. The field of appli- 
cation of the invention can be various, such as traffic 
information systems, public transport enquiry systems, 
and various others. 

[0004] Advantageously, certain diphones are stored 
with processing qualifiers that discriminate processing 
operations on the diphone in question amongst such 
various languages sharing that diphone. Such extra 
processing qualifiers allow to further diminish storage 
requirements. In this way, also diphones that have par- 
ticular limited differences can be shared amongst vari- 
ous languages. Such differences may pertain to dura- 
tion of vowels, aspiration, or vowel quality. 
[0005] The invention also relates to a method for text- 
to-speech transforming and to a device therefor. One 
particular advantage of the invention is the easy manner 
that exonyms can be inserted into a spoken text. This 
may, for example, be realized by inserting language 
identifiers at the start and the termination of the exonym. 
Further advantageous aspects of the invention are re- 
cited in dependent Claims. 

BRIEF DESCRIPTION OF THE DRAWING 

[0006] These and other aspects and advantages of 
the invention will be explained in detail hereinafter with 
reference to the disclosure of preferred embodiments, 
and in particular with reference to the appended Figures 
that show: 

Figure 1 , a flow chart of the diphone preparing; 
Figure 2, an apparatus for text-to-speech transfor- 
mation. 
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DETAILED DESCRIPTION OF PREFERRED 
EMBODIMENTS 

[0007] Figure 1 shows a flow chart of the diphone pre- 
paring, in block 20, the system and user persons pre- 
pare for the task. Appropriate storage and processing 
facilities are allocated, and a list of diphones is set up. 
In block 22, the first diphone is presented to the single 
person that is going to pronounce all diphones of the 
ultimate set. The presentation may be in visual or audio 
form. In block 24, the first diphone is pronounced by the 
single person. In block 26, the sound is checked for cor- 
rect pronunciation. This can be done in a subjective 
manner by one or more reviewer persons, or alterna- 
tively, by some kind of machine comparison with render- 
ings of the same diphone by a standard voice. The pro- 
nunciation may be done in an isolated manner, as a se- 
quence of intendedly identical diphones, or in another 
appropriate way. In case of plural utterances, the 'best 1 
one may be selected by the reviewer. If the pronuncia- 
tion is considered insufficient, the system goes back to 
block 24 and requires another pronunciation of the same 
diphone. The dialog may be supported by visual or audio 
stimuli, in a manner known for human-machine dialogs 
in general. For simplicity this is not detailed further. 
[0008] If the pronunciation is considered sufficient, in 
block 28 the sound is processed, such as clipped in time 
domain, standardized in volume, filtered, if necessary 
transformed into a set of parameters, and stored in a 
conventional manner. For ultimate usage, this storage 
must be random-access, but at this particular instant 
such is not yet necessary. A diphone identifier must be 
added in that case. In block 30 the system detects 
whether it is ready, if no, it reverts to block 22 for the 
next diphone. If yes, in block 32 the activities are closed, 
such as by outputting the results, to wit now a compact 
set of diphones that is suitable for outputting of speech 
in more than one language. Likewise, the output may 
relate to an enhancing set of diphones for expanding an 
n-language set to an n+1 -language set. 
[0009] For certain applications, the stored diphones 
may have processing qualifiers that discriminate 
processing operations on the diphone in question 
amongst such various languages sharing this diphone. 
Examples thereof will be discussed hereinafter. 
[0010] Figure 2 is a block diagram of an apparatus for 
text-to-speech transformation. Block 48 represents the 
storage of the multilingual set, for example realized in 
semiconductor technology. In the embodiment, the de- 
vice is used for four different languages, as symbolized 
by blocks 40-46. Diphone block 40 relates to language 
A, block 42 relates to language B, block 44 relates to 
language C, and block 46 relates to language D. In the 
principal storage, the topmost block therefore relates to 
languages A and B, the second block to languages A, 
C, and D, and so on. In the example, only the second 
last diphone relates only to language C, one diphone 
relates to all four languages, and the others to various 



different combinations. The distribution depends on the 
set of languages, and to a certain extent is open to 
choice. Generally, the number of different diphones is a 
matter of choice. In various cases, a particular diphone 
5 may be reused among several languages, but the usage 
of a unique diphone for each language would give a 
slight improvement. The overall result is a matter of 
trade-off. 

[0011] Block 56 represents the input of a string of di- 
phone codes. These may be associated with a duration, 
aspiration, prosody, or other processing qualifier indica- 
tion that need not unique for the same diphone, either 
when in different languages, or in various positions in a 
sentence. By itself, differentiation of prosody versus the 
position in a sentence has been disclosed in prior, non- 
prepublished European Patent Application EP-A-0 774 
152. The above disclosure however, stores similar 
speech items for one language in different ways. The 
present invention goes the other way round by sharing 
storage among various languages. Likewise, prosody 
control signals may accompany the diphone string. Line 
52 transfers the address code to the storage organiza- 
tion 48. The latter maybe location-based or content-ad- 
dressed, such as alphabetical. Bracket 50 symbolizes 
the read-out mechanism of the storage proper. Block 58 
is the outputting mechanism of the storage, such as am- 
plifier or hold that is controlled by line 54, such as by a 
synchronizing signal. Block 60 effects the transforma- 
tion of the diphone for subsequent rendering, such as 
by windowing in a sequence of diphones, or by translat- 
ing a parametrized diphone to the 'real' sound. Block 62 
is the audio output mechanism, such as a combination 
of output amplifier and loudspeaker. Item 64 symbolizes 
the audio output that may be personal or public address. 
If necessary, speaker verification can be used to verify 
that all diphones so outputted indeed stem from a single 
speaker person. 

FURTHER CONSIDERATIONS 

[0012] A reference for rendering speech in a multi-lan- 
guage environment can be found in R. Collier, Multi-lan- 
guage Intonation Synthesis, Journal of Phonetics 19 
(1991), 61-73. 

[0013] Further, the processing of the diphones has 
two aspects. In the first place, there is the processing 
on the level of the utterance, that is, the diphones are 
concatenated and subsequently, speech pitch and 
rhythm are added. In the second place, there is the 
processing on the level of the diphone itself, before it is 
entered into an utterance. The latter kind of processing 
should lead to further reduction of storage requirements. 
The following are a few examples: 

1. The English word 'mean', the Dutch word 'Mien 1 
(girl's name), and the French word 'mine' (mine, 
such as for producing coal) share the vowel, the on- 
ly difference for near-perfect pronunciation being its 
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duration, which" is longest in English, shortest in 
French. The policy therefore is to use the same 
combination of stored diphones (mi + in), but the 
storage of the first diphone carries different duration 
indications for each respective language that may s 
be in the range of 200-80 milliseconds. 

2. Even in the same language, a particular diphone 
may have to be rendered in different ways. For ex- 
ample, the english word 'pin' has aspiration on the 10 
sound 'p\ whichthe word 'spin 1 has not. The same 
diphone (pi) can be used in other languages, either 
with, or without aspiration. A particular manner for 
diminishing storage requirements is to store the di- 
phone as a waveform with aspiration, and to insert is 
marker indications at the start and at the end of the 
aspiration sound. If required, the aspiration can 
then be foregone. In both of the above cases, the 
processing qualifiers apply differentially to the time 
domain. 20 

3. If the diphones are rather stored as parametric 
information, the processing can be made differen- 
tially in the spectral domain. In the Dutch word 'idee' 
(idea), the second vocal has more or less the char- 25 
acter of a diphthong. In French, the word 'idle' does 

not have this property. By itself, the lending of the 
'diphthong' character can be effected by formant 
manipulation, known by itself. The Dutch represen- 
tation of the diphone in question can now be asso- 30 
ciated with a processing qualifier that controls this 
manipulation, thereby necessitating the storage of 
only one diphone. In a similar manner, the diph- 
thong aspect can have different gradations in inten- 
sity, such as in the case of 'dose* (English), 'dozen' 35 
(Dutch: boxes), and 'dose' (French: dose). It is well 
known that the number of relevant formants is 
small, and their frequencies can be stored as a sin- 
gle quantity, or as a slope between two different 
quantities. In the LPC-based approach, the spectral 40 
manipulation can be combined with manipulation in 
the time domain. 

4. However, even without these differentiating 
processing operations, the approach of the inven- 45 
tion remains valid, either with a slightly lower sound 
quality, or with a somewhat larger storage. The fol- 
lowing examples are given, wherein standard pho- 
netic conventions have been used. For simplicity, 
words have been used that have (nearly) the same so 
text representation in the various languages. How- 
ever, the same reusability principle applies when 

the words are phonetically corresponding only in 
part. 



diphone number 


representation 


language 


1 


#s 


NL, ENG 


2 


sp 


NL, ENG 


3 


Pi 


NL. ENG 


4 


In 


NL, ENG 


5 


n# 


NL, ENG 



[0014] In the above, # indicates silence at the begin- 
ning and end, respectively, of the word in question. In 
both languages, the combination of diphones 
1+2+3+4+5 renders the word 'spin*. 



6 


#p h 


ENG, DE 


7 


p h i 


ENG, DE 


8 


ik(#) 


ENG, DE 



[0015] The sequence 6+7+8 renders the word 'pNk 1 
(peak, Pik) in English and German, respectively. 



9 


#i 


NL, DE, FR 


10 


id 


NL, DE, FR 


11 


de(#) 


NL, DE, FR 


12 


de j (#) 


NL 



[0016] The sequence 9+10+11 renders the word 'ide' 
(German 'Idee', French 'id6e'), the sequence 9+10+12 
renders the word 'ide 1 ' (Dutch 'idee', with a slight diph- 
thong aspect). 



13 


#f 


NL, DE, FR 


14 


fa 


NL, DE, FR 


15 


at 


NL, DE, FR 


16 


ta 


NL, DE, FR 


17 


al 


NL, FR, DE* 


18 


l# 


NL, FR, DE* 



[0017] The sequence 13+14+15+16+17+18 renders 
the basis of the Dutch word fataal' (fatal), German 'fatal' 
or French fatal'. For each language the both vocals get 
their respective lengths adjusted. If required, the last two 
diphones are stored another time for German only (sig- 
nalled by an asterisk *) with a somewhat 'thinner* pro- 
nunciation of the "1 1 , because it is difficult to achieve 
this with differentiated processing for this language. 

Claims 

1 . A method for preparing diphones for text-to-speech 
generating by selectively pronouncing a set of se- 
lected diphones, processing each such diphone for 
persistent storage, and storing each processed di- 
phone in an individually addressable storage loca- 
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tion, characterized by amongst such set as spoken 
by a single person on a homophony basis assigning 
each diphone to one or more diverse languages for 
diminishing storage requirements through sharing 
of selective diphones amongst more than one lan- 
guage. 

2. A method as claimed in Claim 1 , wherein certain 
diphones are stored with processing qualifiers that 
discriminate processing operations on the diphone 
in question amongst such various languages shar- 
ing that diphone. 

3. A method for multi-language text-to-speech trans- 
forming by receiving said text as a string of diphone 
identifiers, read-accessing by each identifier a di- 
phone storage location, and converting each di- 
phone information so read out in a code-to-sound 
converter to an associated sound item, character- 
ized in that such diphones have been generated ac- 
cording to the method of Claims 1 or 2, and selective 
ones amongst said locations are each responsive 
to an associated plurality of diphone identifiers that 
are assigned to a plurality of languages. 

4. A method as claimed in Claim 3, furthermore ac- 
cessing certain' diphones that are stored with 
processing qualifiers for discriminating processing 
operations on the diphone in question amongst 
such various languages sharing that diphone, and 
wherein for speech rendering in a particular lan- 
guage only qualifiers for that particular language 
are taken into account. 

5. A device for multi-language text-to-speech trans- 
forming having input means for receiving said text 
as a string of diphone identifiers, read-access 
means for accessing by each identifier an associat- 
ed diphone storage location, and converting means 
fed by said storage means for converting each di- 
phone information so read out in a code-to-sound 
converter to an associated sound item, character- 
ized in that such diphones have been generated ac- 
cording to the method of Claims 1 or 2 and selective 
ones amongst said locations are each responsive 
to an associated plurality of diphone identifiers that 
are shared amongst a plurality of languages. 

6. A device as claimed in Claim 5, furthermore having 
processing means for accessing diphones that are 
stored with processing qualifiers for discriminating 
processing operations on the diphone in question 
amongst such various languages sharing that di- 
phone, and wherein for speech rendering in a par- 
ticular language such processing means is only re- 
sponsive to qualifiers for that particular language. 



PatentansprOche 

1 . Verfahren zum Vorbereiten von Diphonen zur Text- 
in-Sprache-Generierungdurch selektives Ausspre- 

5 chen einer Menge selektierter Diphone, Verarbei- 
ten jedes solchen Diphons fur dauemde Speiche- 
rung und Speichern jedes verarbeiteten Diphons in 
einer einzeln adressierbaren Speicherstelle, ge- 
kennzeichnetdurch Zuweisen jedes Diphons inner- 

10 halb einer solchen von einer einzelnen Person ge- 
sprochenen Menge auf Homophoniebasis an eine 
oder mehrere verschiedene Sprachen zum Verrin- 
gem der Speicheranforderungen durch Teilen se- 
iektiver Diphone unter mehr als einer Sprache. 

15 

2. Verfahren nach Anspruch 1 , wobei bestimmte Di- 
phone mit Verarbeitungskennzeichnern gespei- 
chert werden, die Verarbeitungsoperationen an 
dem betreffenden Diphon innerhalb solcher ver- 

20 schiedenen dieses Diphon teilenden Sprachen un- 
terscheiden. 

3. Verfahren fur mehrsprachiges Text-in-Sprache- 
Transformieren durch Empfangen des genannten 

25 Textes als Kette von Diphon kennung en, Lesezugriff 
jeder Kennung auf eine Diphon-Speicherstelle und 
Umsetzen jeder so ausgelesenen Diphon informati- 
on in einem Code-Laut-Umwandler in ein zugeho- 
riges Laut-ltem, dadurch gekennzeichnet, dass 
30 derartige Diphone gemaQ dem Verfahren der An- 
spruche 1 oder 2 generiert worden sind und selek- 
tive Stellen innerhalb dieser Stelien jeweils auf eine 
zugehorige Vielzahl von Diphonkennungen anspre- 
chen, die einer Vielzahl von Sprachen zugeordnet 
35 sind. 

4. Verfahren nach Anspruch 3, weiterhin Zugreifen auf 
gewisse Diphone, die mit Verarbeitungskennzeich- 
nern gespeichert sind zum Unterscheiden von Ver- 

40 arbeitungsoperationen an dem betreffenden Di- 
phon innerhalb solcher verschiedenen dieses Di- 
phon teilenden Sprachen, und wobei zur Sprach- 
wiedergabe in einer speziellen Sprache nur Kenn- 
zeichnerfur diese spezielle Sprache be rucks ichtigt 

4S werden. 

5. Einrichtung fur mehrsprachiges Text-in-Sprache- 
Transformieren mit Eingabemitteln zum Empfan- 
gen von Text als Kette aus Diphonkennungen, Le- 

so sezugriffmitteln zum Zugreifen jeder Kennung auf 
eine zugehorige Diphonspeicherstelle und von den 
genannten Speichermitteln gespeisten Umwand- 
lungsmitteln zum Umwandeln jeder so ausgelese- 
nen Diphoninformation in einem Code-Laut-Wand- 
55 ler in ein zugehdriges Laut-ltem, dadurch gekenn- 
zeichnet, dass derartige Diphone gemafJ dem Ver- 
fahren der Anspruche 1 oder 2 generiert worden 
sind und selektive Stellen innerhalb dieser Stellen 
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jeweils auf eine zugehdrige Vielzahl von Diphon- 
kennungen ansprechen, die unter iner Vielzahl 
von Sprachen geteilt werden. 

6. Einrichtung nach Anspruch 5, weiterhin mit Verar- 5 
beitungsmitteln zum Zugreifen auf Diphone, die mit 
Verarbeitungskennzeichnern gespeichert sind zum 
Unterscheiden von Verarbeitungsoperationen an 
dem betreffenden Diphon innerhalb solcher ver- 
schiedenen dieses Diphon teilenden Sprachen, und io 
wobei zur Sprachwiedergabe in einer speziellen 
Sprache solch Verarbeitungsmittel nur auf Kenn- 
zeichner fur diese spezielle Sprache anspricht. 

15 

Revendicatlons 

1. Proc6d6 relatif a l'6laboration de diphonemes pour 
la synthese de texte en parole en prononcant de 
maniere selective une serie de diphonemes selec- 20 
tionnes, en traitant chacun de ces diphonemes en 
vue de leur memorisation persistente, et en memo- 
risant chaque diphoneme traits a un emplacement 

de memoire apte a etre adresse de maniere indivi- 
duelle, caracterise par ^attribution, parmi cette serie 25 
telle que prononcee par une seule personne, sur la 
base d'une homophonie, de chaque diphoneme a 
une ou plusieurs langues pour diminuer les exigen- 
ces de memoire en partageant des diphonemes s6- 
lectionnes entre plus d'une langue. 30 

2. Precede selon la revendication 1, dans lequel cer- 
tains diphonemes sont memorises avec des quali- 
ficatrfs de traitement qui etablissent une distinction 
entre des operations de traitement sur le diphone- 35 
me en question parmi les diverses langues parta- 
geant ce diphoneme. 

3. Proc6de de transformation multilingue de texte en 
parole par reception dudit texte sous forme cfune 40 
chatne d'identificateurs de diphonemes, par acces- 
sion en lecture, par chaque identificateur, a un em- 
placement de memorisation de diphoneme, et par 
conversion de chaque information de diphoneme 
ainsi lue dans un convertisseur code/son en un 6I6- 45 
ment sonore associe, caracteYise en ce que ces di- 
phonemes ont ete produits suivant le precede de la 
revendication 1 ou 2 et que des emplacements se- 
lectionnes parmi lesdits emplacements reagissent 
chacun a une pluralite associee d'identificateurs de so 
diphonemes qui sont attribues a une pluralite de 
langues. 

4. Precede selon la revendication 3, accSdant en 
outre a certains diphonemes qui sont memorises 55 
avec les qualificatifs de traitement pour distinguer 

les operations de traitement sur le diphoneme en 
question parmi les diverses langues partageant ce 



diphoneme, t dans lequel, pour la reproduction 
des paroles dans une langue en particulier, seuis 
les qualificatifs pour cette langue en particulier sont 
pris en compte. 

Dispositrf pour la transformation multilingue de tex- 
te en parole ayant un moyen d'entree pour recevoir 
ledit texte sous la forme d'une chaTne d'identifica- 
teurs de diphonemes, un moyen cfacces en lecture 
pour que chaque identificateur accede a un empla- 
cement associe de memorisation de diphoneme et 
un moyen de conversion aliments par ledit moyen 
de memoire pour convertir chaque information de 
diphoneme ainsi lue dans un convertisseur code/ 
son en un element sonore associe, caracterise en 
ce que ces diphonemes ont et6 produits suivant le 
precede de la revendication 1 ou 2 et en ce que les 
emplacements seiectionnes parmi lesdits emplace- 
ments reagissent chacun a une pluralite associee 
d'identificateurs de diphonemes qui sont partag6s 
entre une pluralite de langues. 

Dispositrf selon la revendication 5, ayant en outre 
un moyen de traitement pour acceder aux diphone- 
mes qui sont memorises avec les qualificatifs de 
traitement destines a op6rer une distinction entre 
les operations de traitement sur le diphoneme en 
question parmi les diverses langues partageant ce 
diphoneme, et dans lequel, pour la reproduction de 
la parole dans une langue en particulier, ce moyen 
de traitement reagit uniquement aux qualificatifs de 
cette langue en particulier. 



6 




FIG.1 



EP 786 132 



Page 8 of 




8 



